In Triton, the fundamental unit of execution shifts from the CUDA scalar thread to the Program Instance. This represents an abstraction of a GPU thread block, where a single instance handles a vectorized "block" of elements simultaneously.
1. The Program Instance Identity
Every execution unit retrieves its identity via pid = tl.program_id(axis=0). Think of a Warehouse Forklift (the Program Instance) picking up a Pallet (the Block) of 128 boxes, compared to a single worker (CUDA thread) picking up one box.
2. Triton vs. PyTorch Tensors
Understanding the semantic gap is crucial for memory management:
- PyTorch Tensor: A host-side Python object wrapping VRAM storage, strides, and metadata.
- Triton Tensor: A compiler-level object representing values or pointers residing in registers or SRAM.
PyTorch View
Python object pointing to contiguous global memory.
Python object pointing to contiguous global memory.
Triton View
A 2D/1D block of data inside compiler registers.
A 2D/1D block of data inside compiler registers.
3. SPMD Nature
Triton follows a Single Program, Multiple Data (SPMD) flow. Every program instance executes the exact same code. Divergence occurs only when logic utilizes the pid to calculate specific memory offsets.
TERMINAL
bash — 80x24
> Ready. Click "Run" to execute.
>